Skip to content

Add DGX Spark / GB10 fork to notable forks#207

Open
laudney wants to merge 1 commit intokarpathy:masterfrom
mmonad:add-dgx-spark-fork
Open

Add DGX Spark / GB10 fork to notable forks#207
laudney wants to merge 1 commit intokarpathy:masterfrom
mmonad:add-dgx-spark-fork

Conversation

@laudney
Copy link

@laudney laudney commented Mar 12, 2026

Summary

Adds our fork for the NVIDIA DGX Spark (GB10 chip, Blackwell sm_121a, 128GB unified memory).

The upstream code targets H100 and uses FA3 kernels + kernels package, neither of which work on the GB10's sm_121a architecture. This fork gets autoresearch running out-of-the-box on DGX Spark with the following changes:

  • Replace FA3 with flex_attention — FA3 kernels are Hopper-only (sm_90). flex_attention uses Triton-generated kernels that work on any architecture including Blackwell sm_121a, with sliding window support via create_block_mask
  • Switch PyTorch wheels from cu128 to cu130 — cu128 only supports up to sm_120. cu130 ships CUDA 13.0 with native sm_121a codegen and cuDNN 9.13, which gave us a 2.86x throughput improvement (49K → 140K tok/sec)
  • Python 3.10 → 3.13 — better compatibility with the cu130 torch wheels and other updated dependencies
  • Fix MFU calculation — report against GB10's 125 TFLOPS BF16 peak instead of H100's 989.5 TFLOPS (otherwise MFU reads ~3% when it's actually ~27%)
  • Drop kernels dependency — not needed once FA3 is replaced with flex_attention

We also benchmarked FA4 (blake-snc sm120 fork) and cuDNN SDPA as alternative attention backends — both showed parity or worse vs flex_attention on this hardware. The GB10 is memory-bandwidth limited at 273 GB/s, so most kernel-level optimizations don't move the needle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant